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Abstract 

Pathway analysis approaches for sequence data typically either operate in a single stage (all variants within all 
genes in the pathway are combined into a single, very large set of variants that can then be analyzed using 
standard "gene-based" test statistics) or in 2-stages (gene-based p values are computed for all genes in the 
pathway, and then the gene-based p values are combined into a single pathway p value). To date, little 
consideration has been given to the performance of gene-based tests (typically designed for a smaller number of 
single-nucleotide variants [SNVs]) when the number of SNVs in the gene or in the pathway is very large and the 
genotypes come from sequence data organized in large pedigrees. We consider recently proposed gene-based 
tests for rare variants from complex pedigrees that test for association between a large set of SNVs and a 
qualitative phenotype of interest (1 -stage analyses) as well as 2-stage approaches. We find that many of these 
methods show inflated type I errors when the number of SNVs in the gene or the pathway is large (>200 SNVs) 
and when using standard approaches to estimate the genotype covariance matrix. Alternative methods are needed 
when testing very large sets of SNVs in 1 -stage approaches. 



Background 

Until recently, the majority of methodological approaches 
for the analysis of common, single-nucleotide variants 
(SNVs; common SNVs have minor allele frequency of at 
least 5%) involved analysis of microarrays using either 
single-marker or multiple-marker approaches, with single- 
marker approaches, by far the more common of the 2 
approaches in practice. Typically, large genome-wide asso- 
ciation studies will conduct hundreds of thousands (or 
millions) of single-marker analyses. Although less com- 
mon in practice, considerable methodological develop- 
ment has taken place in the area of multiple-marker 
analysis in which signals from multiple common SNVs are 
aggregated into a single test of association for the set of 
SNVs of interest (see Refs. [1,2] for recent reviews). 

Of particular interest in the development of multiple- 
marker tests is the blurring of the lines that has recently 
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taken place between what were traditionally considered 
"gene-based" tests of association and "pathway-based" 
tests of association. Historically, gene-based tests of asso- 
ciation aggregated a small number of SNVs within a gene 
into a single test statistic, whereas pathway-based tests 
operated in 2 stages [1,2]. In the traditional approach to 
pathway analysis, researchers first generate a statistic for 
each gene. In a second stage, researchers combine multi- 
ple gene-level statistics into a pathway statistic. Recently, 
however, a single-stage approach was advocated whereby, 
in a single stage, all the SNVs in the pathway all simulta- 
neously aggregated into a single statistic [1]. Effectively, 
this single-stage approach could be considered a "SNV- 
set" approach where the set of SNVs can be defined in 
any biologically plausible manner. Although this rela- 
tively new approach is gaining in popularity in the litera- 
ture, little concrete evidence of its performance relative 
to the 2-stage approach is available. It has been sug- 
gested, however, that the 2-stage approach may be opti- 
mal when there are fewer causal variants in each gene, 
but each with strong risk, and the single-stage approach 
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may be better suited for cases where there are more cau- 
sal variants, but with weaker effects [3-5]. 

With the rapid growth in access to next-generation 
sequencing (NGS) technology, there has been a tidal 
wave of methodological developments to analyze such 
data. At the heart of most analytic approaches for NGS 
data is the desire to appropriately handle rare variants. 
Because statistical tests of individual rare variants lack 
power and multiple testing penalties quickly become 
unwieldy because of the preponderance of rare variants, 
most recently proposed NGS data analysis approaches 
attempt to aggregate multiple SNV association signals 
so as to increase power. These rare SNV-set methods 
can be viewed as a special type of multiple-marker test 
that is particularly useful when SNVs are rare. For a 
broad overview and classification of these rare variant 
approaches for case-control studies see Liu et al [6]. 

There are few family-based tests of rare variant associa- 
tion that are appropriate for complex pedigrees analyzing 
dichotomous phenotypes for association with rare var- 
iants (see Mainline et al [7] for more extensive discus- 
sion). In this article, we apply several of these recently 
proposed approaches to very large sets of SNVs; the very 
large sets are created by combining SNVs from multiple 
genes into pathway based sets of SNVs. The goal of our 
analysis is to evaluate the appropriateness of these meth- 
ods for use with massive SNV sets when the proportion 
of noncausal variants may be very large, but the set may 
also contain multiple causal variants. We evaluate both 
single-stage and traditional 2-stage approaches, in which 
stage 1 computes a test statistic for each gene in the 
pathway, and stage 2 combines the gene-based statistics 
into a single statistic for the pathway. 

Methods 

Sample and genes 

There were 849 individuals in complex pedigrees with 
simulated phenotype data available. We classified each of 
the 849 individuals as either hypertensive (systolic blood 
pressure >140 mm Hg or diastolic blood pressure >90 mm 
Hg) or not hypertensive based on whether the individual 
was classified as hypertensive at any of up to 4 measure- 
ments (waves). For our analysis of the simulated data we 
focused on simulated phenotype 1. We had knowledge of 
the simulation answers for our analysis. SNVs were 
mapped to genes using a custom version of ANNOVAR, 
where a SNV is assigned to a gene if its physical location 
is within the start-stop position of the gene [8]. 

Creation of gene sets 

To evaluate the performance of different approaches to 
the analysis of very large sets of SNVs from multiple 
genes (eg, pathways) we created 800 sets of 5 genes 
each, where 200 of the sets contained no genes with 



causal variants, 200 of the sets contained 1 gene with 
causal variants, 200 of the sets contained 3 genes with 
causal variants, and 200 of the sets had all 5 genes con- 
taining causal variants. Sets were created by randomly 
choosing genes, without replacement, from lists of genes 
that were known (based on the simulation model) to 
contain or not contain causal SNVs. The average num- 
ber of SNVs in the 800 sets was 880 (SD = 468; mini- 
mum = 58; maximum = 2451). 

Previously proposed statistical tests 

We applied 5 different family-based tests of association 

considered by Zhu and Xiong [9] to sets of SNVs. All 
tests were conducted in R using software functions writ- 
ten by Zhu and Xiong and custom scripts. More details 
on the tests are available elsewhere [7,9]. 

The methods considered by Zhu and Xiong utilize a cor- 
rection factor, Pcom which summarizes the additional cor- 
relation in the samples that occurs as a result of the 
complex pedigree structure. P^orr is a function of the esti- 
mated kinship matrix (see Hainline et al [7] for details) 
and is used to adjust the standard error of the test statis- 
tics for the additional correlation contained in the pedigree 
structure. 

Zhu and Xiong [9] propose a generalized multivariate 
test comparing the mean allele counts across n var- 
iants (eg, SNVs within a gene) between the cases and 
controls; Hotelling's test is a multivariate version of 
the 2-sample t-test. Alternatively, Zhu and Xiong also 
consider a version that collapses rare variants below a 
threshold before applying the test (combined multi- 
variate and collapsing [CMC]) or uses eigenvectors from 
the genotype matrix to reduce matrix dimensionality 
(functional principal component analysis [FPCA]; see 
Hainline et al [7] for details). In our implementation of 
CMC we used minor allele frequency cutoffs of 5% and 
0.5%. Briefly stated, in all cases, is computed as if 
there was no pedigree structure in the data (Tj^jtj^;). The 
pedigree-adjusted statistic is computed as Tf^mai/pcorr- 
Finally, Zhu and Xiong investigate an approach that 
applies a single-marker test (pedigree-adjusted single- 
marker /2 test of association) to all SNVs in a set, and 
then uses the minimum p value within the set, as the p 
value for the entire set. Thus, there are 5 test statistics 
in total: 7^, CMCs^, CMCq.so^, FPCA and Xmin' where we 
refer to all methods except Xmin ^ based approaches, 
because they all rely on the Hotelling's 7^ statistic. 

Application of the tests to both 1 -stage and 2-stage 
pathway analysis 

In our analysis, we applied the 5 methods described 
above in 2 different ways. First, we directly applied each 
of the 5 methods to all 800 sets of SNVs (1 -stage pathway 
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analysis). All 1-stage pathway analysis tests are of the null 
hypothesis that no SNVs in the set are associated with 
the phenotype, with an alternative hypothesis that at least 
1 SNV in the set is associated with the phenotype. 

Second, we applied each of the 5 methods to all 5 indi- 
vidual genes contained within each of the 800 sets. We 
then combined the 5 gene p values for each test-set com- 
bination using Fisher's combined probability test (2-stage 
pathway analysis). Fisher's combined probability test is 

defined as Fisherp = ^ — 2log(pi) where />, is the 

p value from the /"^ gene in the set and Fisherp ~ Xw 
yielding 1 Fisher's combined probability test p value for 
each of T^, CMC5%, CMCo.5% , FPCA, and Xmin> 
described in the previous section. Fisher's combined 
probability test is one of many choices for 2-stage path- 
way analysis [5], but has the convenient advantage of 
having a known null distribution when the p values being 
combined are independent (in our case, this means that 
there is no linkage disequilibrium between the randomly 
chosen genes in the sets, which is a reasonable assump- 
tion). All 2-stage analyses have a null hypothesis that 
none of the genes in the set of genes contain any SNVs 
associated with the phenotype, with an alternative 
hypothesis that at least 1 gene in the set of genes contains 
at least one SNV associated with the phenotype. 

Modified implementation of the Zhu and Xiong methods 

When the Zhu and Xiong approaches are applied to the 
800 sets of SNVs, a dramatic inflation of the type I error 
rate is observed. In particular, among the 5 single-stage 
approaches, type I error rates ranged from 43.5% to 
99.5%, and among the 5 two-stage approaches, type I 
error rates ranged from 7.5% to 90.5%. Although we do 
not provide detailed results here, we note that the magni- 
tude of the type I error rate increased as the number of 
SNVs in the set increased; in particular, the inflated type 
I error rates only occurred on sets containing more than 
200 SNVs. We determined that a potential cause of the 
inflated type I error rate was in the approach taken when 
estimating the SNV genotype covariance matrix [10]. 
When analyzing large sets (>200) of SNVs and given the 
relatively small sample size (n = 849), prior research 
shows that the use of the maximum likelihood estimates 
(MLEs) in covariance matrix estimation may yield 
unstable results, and that a shrinkage covariance estima- 
tor may perform better [10]. 

We modified the covariance estimation procedure for 
all methods, replacing the MLEs with the shrinkage esti- 
mator. Although we lost the guarantee of the analytic 
null distribution derived by Zhu and Xiong, we explored 
the use of the shrinkage estimator using the null distri- 
bution in Zhu and Xiong. We found that this approach 
provided increasingly overconservative results as the set 



size increased (detailed results not shown). Nonetheless, 
we still expect that large values of the test statistic 
should lead to rejection of the null hypothesis. 

To account for the effect of set size on the behavior of 
the statistic, we estimated the empirical cumulative distri- 
bution function of the p values for the "null" sets (sets 
containing no causal SNVs) separately as a function of the 
number of SNVs in the set for each of the test statistics. 

In particular, Fm,n{t) = \ T^'' 1 \pi,m < A' where Pi^„, 
k ^—'1=1 

/ = l...k, are the p values for test statistic m (eg, 7^, CMC) 
when the set size is «. To obtain robust estimates of F(t), 
we binned sets of similar set size; there were 5 bins in 
total, representing quintile breaks in the null set size distri- 
bution. To provide appropriate control of the type I error 
rate and allow evaluation of the different methods, we 
computed a modified p value, p{modified)im = Pm,n{pi,m]> 
for all null and nonnull (contain at least 1 causal SNV) 
sets. These modified p values necessarily control the type I 
error rate for null sets, allowing us to get a sense of the 
performance of the different methods if we had an appro- 
priate null distribution for the new test statistics. Although 
in practice we do not have knowledge of the null set of 
pathways, this serves as an exploratory analysis to motivate 
derivation of a closed form null distribution for the modi- 
fied Zhu and Xiong statistic. 

Results 

We applied each of the 10 tests described earlier to all 
800 sets of genes. Table 1 illustrates the percent of sig- 
nificant sets (a = 0.05) across the 800 sets, where we 
stratify the 800 sets into 200 sets containing varying 
numbers of causal and noncausal genes. 

By design, sets containing no causal genes have 
empirical type I error rates of approximately 5%. The 
reason for some deviation from 5% in Table 1 is because 
of a combination of estimating the empirical null distri- 
bution based on 200 null sets and the binning procedure 
used (see Methods). In general, modest increases in the 
percent of significant sets are observed as more causal 
genes (genes containing at least 1 causal SNV) are 
added to the set for single-stage and 2-stage methods 
based on FPCA and Xmin' single-stage CMCo.5%, while 
other methods showed little difference in the percent of 
significant null sets and nonnull sets. 

Discussion 

Few analyses have considered methods to analyze path- 
ways (sets of genes; large sets of SNVs) for association 
with rare variants in family studies. Previous research 
showed that, depending on the underlying genetic archi- 
tecture, either single-stage or 2-stage approaches to 
pathway testing may provide a powerful testing 
approach. However, this result has not been rigorously 
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Table 1 Percent significant sets at a = 0.05 by number of causal and noncausal genes in the set using modified p 
values 



Pathway testing approach Number of causal and noncausal genes In the set 

(ail sets contain 5 genes) 





0 causai. 


1 causai. 


3 causai. 


5 causal. 




5 noncausal 


4 noncausal 


2 noncausal 


0 noncausal 


Single-stage approacli 










FPCA 


2.5% (5/200) 


6.5% (13/200) 


5.0% (10/200) 


9.0% (18/200) 




3.5% (7/200) 


6.5% (1 3/200) 


8.5% (1 7/200) 


16.5% (33/200) 


f 


5.0% (10/200) 


3.0% (6/200) 


1.5% (3/200) 


2.5% (5/200) 


CMC5% 


1.5% (3/200) 


0.0% (0/200) 


2.5% (5/200) 


1.5% (3/200) 


CMCo.50/. 


3.5% (7/200) 


5.5% (11/200) 


9.5% (19/200) 


8.0% (16/200) 


Two-stage approacli 










FPCA 


2.0% (4/200) 


1.5% (3/200) 


9.0% (18/200) 


5.0% (10/200) 


Fishers y^. 


3.5% (7/200) 


4.0% (8/200) 


8.5% (17/200) 


11.0% (22/100) 


Fishers_ 


1.5% (3/200) 


3.5% (7/200) 


2.0% (4/200) 


1.0% (2/200) 


Fishers_ CMC5% 


3.0% (6/200) 


2.0% (4/200) 


3.5% (7/200) 


1.0% (2/200) 


Fishers_ CMCo,5% 


2.0% (4/200) 


2.0% (4/200) 


3.5% (7/200) 


0.5% (1/200) 



established across the large class of pathway testing 
approaches, especially rare variant pedigree data. In this 
article, we considered both 1-stage and 2-stage 
approaches to pathway analysis by applying methods 
proposed by Zhu and Xiong to very large multi-SNV 
sets (1-stage), or a 2-stage approach using Fisher's com- 
bined probability test in conjunction with the methods 
of Zhu and Xiong. 

As with most gene-based testing approaches the 
authors of the primary methods considered here only 
evaluated their method on "small" sets of SNVs (189 
SNVs) [9]. However, the assumption of many authors of 
gene-based testing approaches is that the methods can be 
applied to large sets of SNVs that may contain SNVs 
from multiple genes (see, eg, Madsen and Browning [3]). 
This is both a reasonable and natural assumption, 
although one that has rarely been considered explicitly in 
the literature. 

In our consideration of 1-stage approaches, we found 
a substantially inflated type I error rate, which increased 
as the number of SNVs increased. Further analysis pin- 
pointed the issue to the use of MLEs when estimating 
the genotype covariance matrix (as in the Zhu and 
Xiong code). Schafer and Strimmer [10] showed that a 
shrinkage estimator provides robust covariance esti- 
mates as the number of SNVs increases relative to the 
sample size. When we implemented the shrinkage esti- 
mator, the results were overly conservative (empirical 
type I error rate was substantially less than the nominal 
rate) when using the null distribution for MLEs from 
Zhu and Xiong. This pattern of findings also held true 
for 2-stage approaches when any genes in the set were 
large (containing more than 200 SNVs). To address 
these limitations, we applied an empirical correction 



factor to the p values, which allowed us to examine per- 
formance of the 10 methods. 

Conclusions 

Further research is necessary to develop alternative cov- 
ariance estimation procedures and corresponding null 
distributions for large (more than 200 SNVs) sets. We 
note that, in both a companion paper [7] and here, use of 
the MLE estimation approach yields well-controlled type 
I error rates for SNV sets with less than 200 SNVs. 
Permutation approaches to control type I error should 
also be explored, however, given the complex pedigree 
structure present in the data will require gene dropping 
or related approaches that may limit their practical 
utility. 

Assuming that appropriate finite sample null distribu- 
tions can be derived when applying SNV-set methods to 
very large sets of SNVs, the true underlying genetic disease 
architecture will play a significant role in determining 
which statistical methods will perform best in practice. A 
recent article evaluating the differences in rare variant 
tests of association illustrated that for sets of SNVs where 
the proportion of noncausal SNVs is very large, as will 
likely be the case in 1-stage approaches for rare variants, 
tests like the minimum ^^.^ (equivalent to an L„.o norm) 
will perform better than 7^-based (L2 or Li norm) type 
tests (see Liu et al [6] for details) by empirically upweight- 
ing the strongest SNV-phenotype associations. 

Few methods exist for the analysis of potential relation- 
ships between binary phenotypes and rare genetic varia- 
tion; fewer still have considered how such methods will 
perform on very large sets of SNVs that may span multi- 
ple genes. Our analysis identified substantial inflation of 
the type I error rate using a standard approach, and so 
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we implemented an empirical approach to evaluate the 
relative performance of different methods. Further 
research is necessary to explore robust test statistics and 
analytic strategies for large SNV sets in complex pedi- 
grees across a wide variety of genetic architectures. 
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